This document provides an overview of useful information when coming to the EU Tax Observatory.
Author: EU Tax Observatory
Date: June 19, 2025
Coding and data
We first present the main principles for achieving reproducible results, whereas the second part provides empirical advice for facilitating collaboration on research projects and ensuring good monitoring.
Maybe a motivation for these good practices
Reproducibility is a key step to ensure the credibility of the results. We detail key steps to follow: transparency, comments on coding scripts, data accessibility, and software versioning.
Transparency
This principle is quite simple: every methodological part, every choice, and every coding line must be accessible to outsiders.
To expand: why transparency matters? Maybe motivate through some scientific scandals
Comments
Coding files can be hard to understand, especially for non-experts of the language of consideration. Moreover, data processing can be achieved following different packages, especially in R, which might make code reading difficult for outsiders. For instance, let’s take the following example.
df_data[, u_p := val / sur]df_data <- df_data[u_p >120& u_p <20000]df_data[, sur :=NULL]lm(u_p ~ y_sale + h_type + l_mut, data = df_data)
gen u_p = val / surkeepif u_p > 120 & u_p < 20000drop surregress u_p y_sale i.h_type i.origin
If you are not an R expert, understanding such a chunk of code might difficult. Hence, assessing whether there is an issue or a bad methodological choice is highly challenging. Now, let’s look at the commented version of this script.
# we create a new column `u_p` that represent the housing unitary price: # housing value (`val`) divided by surface (`sur`)df_data[, u_p := val / sur]# we now filter observations based on the housing unitary housing price.# we keep observations above 120 and under 20,000df_data <- df_data[u_p >120& u_p <20000]# we remove the column surfacedf_data[, sur :=NULL]# finally, we regress the unitary housing price on multiple variables:# - y_sale: year of sale# - h_type: housing type# - l_mut: last date of mutation# coefficients are obtained through OLSlm(u_p ~ y_sale + h_type + l_mut, data = df_data)
* Create the newvariable u_pgen u_p = val / sur* Keep observations where u_p is between 120 and 20000keepif u_p > 120 & u_p < 20000* Drop the variable surdrop sur* Run the linear regressionregress u_p y_sale i.h_type i.origin
Comments are extremely helpful for others to understand your code. As a result, they can more easily spot any coding mistakes or poor methodological choices.
Additionally, consider revisiting an old project six months or a year later. Your coding practices may have evolved, and you might have switched packages, making it more challenging to navigate through your code. This is particularly true for lengthy code files, which can be difficult to comprehend after being away from them for several months!
However, for straightforward code, comments are not mandatory as they complicate the reading of the code. Hence, commenting code is a balance between having a clear explanation about what is performed and keep a code as clean as possible. Performing code review (see Section 2.6) helps to adjust comments on code.
Data accessibility
The data used for the empirical analysis are a key information to provide to ensure consistent and reproducible information. Hence, we need to detail following information:
Data source: where can outsiders access the data?
Data version: for updated datasets, specify the version used in this project (especially for flow data)
Metadata
Software versionning
Finally, a point that is mostly omitted in reproducible coding is software and package versioning. Package or software might evolve which might introduce some bugs in your code. For instance, a code written in 2019 might be broken if we use the current packages. Hence, we need to specify the version of the main software (R, Python, or Stata) and also the attached version of the packages used. For R, we provide the following code to access software and packages versions.
# it returns the current version of package of interestpackageVersion("data.table")
[1] '1.17.8'
* Display Stata version informationversion* Display information about installed packagesado dir* To getmore detailed information about a specific package, you can use:ado describe <package_name>
Here, the current version of R is 4.4.3, whereas the version of the data.table package is 1.17.0. I need to share this information to ensure that the results are fully reproducible, even in 10 years. An alternative is to use a Docker file which provides all relevant files and packages to run your script and replicate your results.
Empirical advice
In this section, we provide some empirical advice to follow to monitor and manage large project with multiple data sources and collaborators. First, we present the use of a general logfile to ease navigating through the code. Second, we explain how to share code and keep tracks of changes in the script over time, using software like GitHub. Finally, we provide some general advice to organize a working directory.
The logfile
To monitor projects, and more specifically long-term projects, having a logfile is highly advised. This file registers all modifications being made with information such as the date, the author, and a brief explanation. It helps to track the evolution of the project, the main changes within the project, and also individuals’ contribution to the coding part of the research project. Here is a snapshot of the logfile used for an ongoing project:
# To-do- [ ] Make some descriptive statistics about tenure status- [ ] Merging and comparison with Orbis database- [ ] Need to assess the quality of changes within the BO register- [ ] Launch a new collection for 2025- [ ] Outcomes about rental eviction, housing maintenance and renters income to compute- [ ] Access to the TVVI database# Previous changes## 25.05.06- Code for the internal workshop is ok. RL 2db4c55 - Figures are currently working. First push in a GitHub repo in the next weeks. RL 2db4c55- Code review. RL 2db4c55## 25.04.20- New code to have matching rate per percentile. RL 688eea8- We also account for the individual shares and highlight that the 25% rule is a main limit for transparency. RL 688eea8- New map for Paris. RL 688eea8
Note
The logfile is most of the time written in lightweight open source format such as txt or Markdown to be easily readable and writable, regardless the OS.
The logfile might also contain a to-do section. Research projects are not always linear, and writing future steps to introduce in this file is useful. Indeed, when we re-open a new project, looking at a logfile provides us a good overview of what was achieved and what is needed.
Finally, sharing a logfile between contributors is a key element to coordinating your efforts. Having a look at the logfile enables all team members involved in the project to see what are the future steps to be implemented, whereas they can easily monitor previous achievements. The authorship makes communication easier, especially if there is a misunderstanding in the coding parts.
README file
The README is probably the first file an outside individual will open when accessing your research project. In your README file, you must put important information such as:
Title and authorship
The main objective of the research project
Information about how to access the data
What are the main software being used in the project
The license of your code, in our case this is mostly open, but it depends
Any explanation that helps individuals to understand you repository
Note
The README is always written in open file format, mainly Markdown.
An example of a README file
# Name of the project## OverviewHere we discuss the objective of the project, a snapshot of the main conclusion, and potential redirection to the paper.## Features of the codeList big steps of your code being accessible. For instance:- Data collection- Data cleaning- Filtering of the dataset- Descriptive statistics- Econometric analysis## LicenseThis project is licensed under the MIT License. See the LICENSE file for repo details.## ContactFor any questions or feedback, please contact:Your Name: GitHub:
Sharing code and versionning
Whereas comments on coding scripts, data accessibility, and transparency are the key elements for full reproducibility, the main issue is how to collaborate and have a consistent repository between the collaborators of the project. Let’s say that two members work jointly on the coding part, keeping the script up to date for both is challenging. Moreover, let’s say that you want to reproduce a figure that was produced six months ago in your script, how is it possible to smoothly do it?
The key is to have a shared repository for all team members. In such way, every change is transparent and easily accessible. In addition, you need to have all version of the coding scripts.
GitHub enables us to achieve both objectives. First, it provides consistent framework to collaborate between team members on the script. In addition, it highlights every change between two versions of code which make tracking changes from other members easier. Second, it provides all version of coding scripts (even the script at the beginning of the project), without the need to specify _v2, _v3, _v100 which are very painful on the long term!
So, how to use GitHub? You don’t need to be a nerd with Terminal applications being used to manage a GitHub project. The GitHub Desktop enables smoothly managing such a project and provide interface to easily monitor changes over time.
Let’s talk about the main actions for a GitHub project. For setting up a repository, see here. We now have a repository with two team members such as
Let’s say that the member 1 works on the project and make some significant changes in this own laptop. For now, changes made by Member 1 are not shared with Member 2. It has to commit and push its changes.
Now, consider that changes made by Member 1 are committed and pushed. The Member 2 wants to work on the project and check the track changes. It has to pull the project on its laptop to see any changes. After that, both repository are synchronized.
In addition, the Member 2 can monitor changes being achieved by Member 1 in more details, without reviewing all the code. Many software offer this possibility, but the GitHub Desktop offers good highlight such as
Tracking changes
Note
Here, lines highlighted in green are added by Member 1, whereas lines highlighted in red are removed by Member 1. The stable part of the coding file is uncolored (not applicable here).
In the end, every member can trace back every change in the coding file using the number of the commit. By doing so, it ensures to keep every version of the coding files. Every member can thus generate every figure or table, regardless of which version the figure or table was. To sum up, the project is composed of numerous evolutions that can be identified by their number such as
Hence, we advise you to put this number in your logfile when making a commit and a significant change in the project. It helps to monitor and follow the evolution of the code.
Finally, the GitHub environment is not accessible in the CASD for security reasons. In the CASD, the repository is shared between all team members. However, there is no system to track changes in code between members. However, it is possible to use Git locally (the software behind GitHub) to monitor code in the same way locally and benefit from versioning and changes highlighting.
Organizing directory and files
Besides the scripts, structuring a coding directory is also a good practice to have. First, one script to do everything must be avoided. Research projects can be large, including some data loading, data management, descriptive statistics, econometric analysis. One standalone file would be too large to be easily understandable by team members or outsiders.
On the other hand, spread-out files are hard to understand. Let’s say you join an ongoing project with multiple files to be handled. How to know which is the first to execute? Order in scripting files is a key element to ensure full reproducibility. Let’s say you run the econometric analysis before the filtering step, results would be dramatically different. Hence, one file must aggregate everything and call each subscript.
We can call this script main_script and structure the code as follows (example from an ongoing project)
################################################################################# INTREALES Project Code Preamble################################################################################# -----------------------------------------------------------------------------# Project Information# -----------------------------------------------------------------------------# Author: Author 1, Author 2, Author 3# Title: Code of super cool project# Date: 2025-04-08# Version: 1.0# -----------------------------------------------------------------------------# Load Necessary Libraries# -----------------------------------------------------------------------------# load all relevant packages for the analysissource("init/packages.R")theme_update(text =element_text(family ="serif"))# -----------------------------------------------------------------------------# Additional Setup or Configuration# -----------------------------------------------------------------------------output_table <-"output_code/table/"## location of table outputsoutput_figure <-"output_code/figure/"## location of figure outputschoice_w <-16# width of the output graphics (in inch)choice_h <-9# height of the output graphics (in inch)# -----------------------------------------------------------------------------# Main Code# -----------------------------------------------------------------------------################################################################################# Loading data -----------------------------------------------------------------source("code/data/01_loading_data.R")source("code/data/02_filtering_data.R")# Descriptive statistics about the topic of interest ---------------------------source("code/descriptive_statistics/01_summary_stat_sample.R")source("code/descriptive_statistics/02_stat_observation_interest.R")# Running an econometric analysis ---------------------------------------------source("code/econometric_analysis/01_diff_in_diff.R")source("code/econometric_analysis/02_robustness_chekcs.R")source("code/econometric_analysis/03_placebo.R")
* ################################################################################* INTREALES Project Code Preamble* ################################################################################* -----------------------------------------------------------------------------* Project Information* -----------------------------------------------------------------------------* Author: Author 1, Author 2, Author 3* Title: Code ofsuper cool project* Date: 2025-04-08* Version: 1.0* -----------------------------------------------------------------------------* Load Necessary Libraries* -----------------------------------------------------------------------------* In Stata, we typically use `ssc install` or `net install` to install packages.* For example, to install a package, you might use:* ssc install package_name* -----------------------------------------------------------------------------* Additional Setup or Configuration* -----------------------------------------------------------------------------* Define output directoriesglobal output_table "output_code/table/" ## location oftable outputsglobal output_figure "output_code/figure/" ## location of figure outputs* Define graphics dimensionsglobal choice_w = 16 # widthof the output graphics (in inch)global choice_h = 9 # height of the output graphics (in inch)* -----------------------------------------------------------------------------* Main Code* -----------------------------------------------------------------------------* ################################################################################* Loading data -----------------------------------------------------------------do"code/data/01_loading_data.do"do"code/data/02_filtering_data.do"* Descriptive statistics about the topic of interest ---------------------------do"code/descriptive_statistics/01_summary_stat_sample.do"do"code/descriptive_statistics/02_stat_observation_interest.do"* Running an econometric analysis ---------------------------------------------do"code/econometric_analysis/01_diff_in_diff.do"do"code/econometric_analysis/02_robustness_checks.do"do"code/econometric_analysis/03_placebo.do"
The main script is quite simple to read regardless you are an R expert. The objective is to provide all needed steps with some comments to understand each step. The script can be decomposed in different sections.
First, we introduce a preamble. We provide the key information about the project such as the persons involved in it, the date, and the objective of the project. Second, we load everything we need. It avoids loading multiple times a package. Also, it is helpful when you need to provide all relevant packages to be loaded to an alternative member. You can also easily list all your packages when you need to provide the version to ensure full reproducibility. Third, we load the data. Then, we run the analysis.
Naming files is an important thing. When you open a coding directory, you want to understand easily how the directory is structured. That’s why we advise you to store your subscript within subdirectories with explicit names such as descriptive statistics, data, or econometric_analysis. It helps to navigate through the directory and understand the code and the underlying choices (which is, in the end, the main thing). Within each subdirectory, we advise you to number coding files, just to remind you in which order these scripts must be executed. In the same fashion, naming objects within the script should be as clear as possible to ensure team members understand what is what (but remind you that comments are helpful also!).
Finally, we can sum up the structure of the project as follows in a more general manner.
Here, we have two main directories (code and output_code). The second directory contains all figures and tables being produced by the script. Hence, it is easy to search for output figures.
Note
This is just general advice! Of course, it depends on the project, the goal to be achieved, and keeping some flexibility is always a good idea!
Storing data
The main script is quite simple to read regardless you are an R expert. The GitHub is a platform to share and collaborate on code. But, how about data? Generally speaking, we should avoid storing any data on it for two reasons. First, it relates to the sensitivity of the data.
To store data securely, you can use the PSE NextCloud solution. Each member of the EU Tax Observatory is a member of PSE NextCloud which offers a Cloud solution for storing files, documents, or data. It is similar to common Cloud services such as Dropbox, but the servers are located in Paris, within PSE. Hence, it provides a way to store and share data between members which is compliant with the RGPD. You can improve the security level of this repository by adding passwords or time restrictions.
Note
In R, there is a package to link your coding repository with your NextCloud data repo.
Besides the storage aspect, it is important to store data on a easily readable format such as csv or parquet to ensure an easy opening for other members or replicators. Proprietary format such as Excel should be avoided.
Code review
Finally, mistakes in coding files are common. Performing time to time some code review might be a good practice to ensure that everything runs as planned. For instance, you can add a 0 when filtering data that affects the composition of your sample, introduce a filter to simplify the data process in the first place that remains, or just comment on some parts of the code that are useful. Making code review helps to track and correct these potential mistakes. When you are writing your coding file, you don’t have necessarily the hindsight to identify all issues (think about when writing a draft, you have some inconsistency all along).
A code review is just looking at the entire code sequentially and ensuring that everything is fine. First, you must be sure that your code is not bugged when you are running it. If it happens, you must correct it. Second, you must track any typo issues such as the filtering process, wrong column assignment, etc. that might affect the results. Third, you may find your code too complex. If you can simplify it through a partial re-writing process, you should do it. Even if comments help understand your process, the easier your code is, the more understandable is.
Comments
Coding files can be hard to understand, especially for non-experts of the language of consideration. Moreover, data processing can be achieved following different packages, especially in R, which might make code reading difficult for outsiders. For instance, let’s take the following example.
If you are not an R expert, understanding such a chunk of code might difficult. Hence, assessing whether there is an issue or a bad methodological choice is highly challenging. Now, let’s look at the commented version of this script.
Comments are extremely helpful for others to understand your code. As a result, they can more easily spot any coding mistakes or poor methodological choices.
Additionally, consider revisiting an old project six months or a year later. Your coding practices may have evolved, and you might have switched packages, making it more challenging to navigate through your code. This is particularly true for lengthy code files, which can be difficult to comprehend after being away from them for several months!
However, for straightforward code, comments are not mandatory as they complicate the reading of the code. Hence, commenting code is a balance between having a clear explanation about what is performed and keep a code as clean as possible. Performing code review (see Section 2.6) helps to adjust comments on code.